Goto

Collaborating Authors

 Saint Louis County


Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding

Benhammou, Yassir, Kalyan, Suman, Kumar, Sujay

arXiv.org Artificial Intelligence

Broadcast and media organizations increasingly rely on artificial intelligence to automate the labor-intensive processes of content indexing, tagging, and metadata generation. However, existing AI systems typically operate on a single modality-such as video, audio, or text-limiting their understanding of complex, cross-modal relationships in broadcast material. In this work, we propose a Multimodal Autoencoder (MMAE) that learns unified representations across text, audio, and visual data, enabling end-to-end automation of metadata extraction and semantic clustering. The model is trained on the recently introduced LUMA dataset, a fully aligned benchmark of multimodal triplets representative of real-world media content. By minimizing joint reconstruction losses across modalities, the MMAE discovers modality-invariant semantic structures without relying on large paired or contrastive datasets. We demonstrate significant improvements in clustering and alignment metrics (Silhouette, ARI, NMI) compared to linear baselines, indicating that reconstruction-based multimodal embeddings can serve as a foundation for scalable metadata generation and cross-modal retrieval in broadcast archives. These results highlight the potential of reconstruction-driven multimodal learning to enhance automation, searchability, and content management efficiency in modern broadcast workflows.


"Monuments," Reviewed: The Confederacy Surrenders to a Truer American Past

The New Yorker

As the Trump Administration tries to rescue symbols of the Lost Cause, an exhibition in Los Angeles, led by Kara Walker, finds meaning in their desecration. Kara Walker's "Unmanned Drone" (2023) transforms a Stonewall Jackson statue. The first thing you see is a horse's ass, protruding, upside down, from the thorax of a monster. A man's arm descends from the beast's stomach, his gloved hand clutching the blade of a fallen sabre. Every part of the work comes from a statue of the Confederate general Stonewall Jackson that was removed from Charlottesville, Virginia, in 2021.


Podcasts as a Medium for Participation in Collective Action: A Case Study of Black Lives Matter

Moldovan, Theodora, Pera, Arianna, Vega, Davide, Aiello, Luca Maria

arXiv.org Artificial Intelligence

We study how participation in collective action is articulated in podcast discussions, using the Black Lives Matter (BLM) movement as a case study. While research on collective action discourse has primarily focused on text-based content, this study takes a first step toward analyzing audio formats by using podcast transcripts. Using the Structured Podcast Research Corpus (SPoRC), we investigated spoken language expressions of participation in collective action, categorized as problem-solution, call-to-action, intention, and execution. We identified podcast episodes discussing racial justice after important BLM-related events in May and June of 2020, and extracted participatory statements using a layered framework adapted from prior work on social media. We examined the emotional dimensions of these statements, detecting eight key emotions and their association with varying stages of activism. We found that emotional profiles vary by stage, with different positive emotions standing out during calls-to-action, intention, and execution. We detected negative associations between collective action and negative emotions, contrary to theoretical expectations. Our work contributes to a better understanding of how activism is expressed in spoken digital discourse and how emotional framing may depend on the format of the discussion.


Towards Unraveling and Improving Generalization in World Models

Fang, Qiaoyi, Du, Weiyu, Wang, Hang, Zhang, Junshan

arXiv.org Artificial Intelligence

World models have recently emerged as a promising approach to reinforcement learning (RL), achieving state-of-the-art performance across a wide range of visual control tasks. This work aims to obtain a deep understanding of the robustness and generalization capabilities of world models. Thus motivated, we develop a stochastic differential equation formulation by treating the world model learning as a stochastic dynamical system, and characterize the impact of latent representation errors on robustness and generalization, for both cases with zero-drift representation errors and with non-zero-drift representation errors. Our somewhat surprising findings, based on both theoretic and experimental studies, reveal that for the case with zero drift, modest latent representation errors can in fact function as implicit regularization and hence result in improved robustness. We further propose a Jacobian regularization scheme to mitigate the compounding error propagation effects of non-zero drift, thereby enhancing training stability and robustness. Our experimental studies corroborate that this regularization approach not only stabilizes training but also accelerates convergence and improves accuracy of long-horizon prediction.


The Download: Chinese LLMs, and transforming heavy-duty trucking

MIT Technology Review

When police departments first started buying and deploying bodycams in the wake of the police killing of Michael Brown in Ferguson, Missouri, a decade ago, activists hoped it would bring about real change. Years later, despite what's become a multibillion-dollar market for these devices, the tech is far from a panacea. Most footage they generate goes unwatched. And if they do finally provide video to the public, it usually doesn't tell the complete story. A handful of AI startups see this problem as an opportunity to create what are essentially bodycam-to-text programs for different players in the legal system, mining this footage for misdeeds. But like the bodycams themselves, the technology still faces procedural, legal, and cultural barriers to success.


The Download: the problem with police bodycams, and how to make useful robots

MIT Technology Review

When police departments first started buying and deploying bodycams in the wake of the police killing of Michael Brown in Ferguson, Missouri, a decade ago, activists hoped it would bring about real change. Years later, despite what's become a multibillion-dollar market for these devices, the tech is far from a panacea. Most of the vast reams of footage they generate go unwatched. And if they do finally provide video to the public, it's often selectively edited, lacking context and failing to tell the complete story. A handful of AI startups see this problem as an opportunity to create what are essentially bodycam-to-text programs for different players in the legal system, mining this footage for misdeeds.


Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models

Zhang, Xinyu, Hofstätter, Sebastian, Lewis, Patrick, Tang, Raphael, Lin, Jimmy

arXiv.org Artificial Intelligence

Listwise rerankers based on large language models (LLM) are the zero-shot state-of-the-art. However, current works in this direction all depend on the GPT models, making it a single point of failure in scientific reproducibility. Moreover, it raises the concern that the current research findings only hold for GPT models but not LLM in general. In this work, we lift this pre-condition and build for the first time effective listwise rerankers without any form of dependency on GPT. Our passage retrieval experiments show that our best list se reranker surpasses the listwise rerankers based on GPT-3.5 by 13% and achieves 97% effectiveness of the ones built on GPT-4. Our results also show that the existing training datasets, which were expressly constructed for pointwise ranking, are insufficient for building such listwise rerankers. Instead, high-quality listwise ranking data is required and crucial, calling for further work on building human-annotated listwise data resources.


Generate rather than Retrieve: Large Language Models are Strong Context Generators

Yu, Wenhao, Iter, Dan, Wang, Shuohang, Xu, Yichong, Ju, Mingxuan, Sanyal, Soumya, Zhu, Chenguang, Zeng, Michael, Jiang, Meng

arXiv.org Artificial Intelligence

Knowledge-intensive tasks, such as open-domain question answering (QA), require access to a large amount of world or domain knowledge. A common approach for knowledge-intensive tasks is to employ a retrieve-then-read pipeline that first retrieves a handful of relevant contextual documents from an external corpus such as Wikipedia and then predicts an answer conditioned on the retrieved documents. In this paper, we present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators. We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer. Furthermore, we propose a novel clustering-based prompting method that selects distinct prompts, resulting in the generated documents that cover different perspectives, leading to better recall over acceptable answers. We conduct extensive experiments on three different knowledge-intensive tasks, including open-domain QA, fact checking, and dialogue system. Notably, GenRead achieves 71.6 and 54.4 exact match scores on TriviaQA and WebQ, significantly outperforming the state-of-the-art retrieve-then-read pipeline DPR-FiD by +4.0 and +3.9, without retrieving any documents from any external knowledge source. Lastly, we demonstrate the model performance can be further improved by combining retrieval and generation. Our code and generated documents can be found at https://github.com/wyu97/GenRead.


Should Local Police Departments Deploy Lethal Robots?

The New Yorker

Last month, the San Francisco Board of Supervisors voted in favor of allowing that city's police department to deploy robots equipped with a potential to kill, should a situation--in the estimation of police officers--call for lethal force. With that decision, the board appeared to have delivered the city to a dystopian future. The vote garnered a loudly negative response from the public, and this week the supervisors reversed course and sent the policy back to committee. But the fact that the decision initially passed--and may yet pass in some form--should not have been surprising. Police departments around the country have been acquiring robotic devices for decades.


San Francisco will allow police to deploy robots that kill

#artificialintelligence

Supervisors in San Francisco voted Tuesday to give city police the ability to use potentially lethal, remote-controlled robots in emergency situations -- following an emotionally charged debate that reflected divisions on the politically liberal board over support for law enforcement. The vote was 8-3, with the majority agreeing to grant police the option despite strong objections from civil liberties and other police oversight groups. Opponents said the authority would lead to the further militarization of a police force already too aggressive with poor and minority communities. Supervisor Connie Chan, a member of the committee that forwarded the proposal to the full board, said she understood concerns over use of force but that "according to state law, we are required to approve the use of these equipments. So here we are, and it's definitely not a easy discussion."